Importing Libraries

In [52]:
import pandas as pd
import numpy as np
import plotly_express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from statsmodels.graphics.tsaplots import plot_acf
import seaborn as sns
import scipy.stats as stats
import bamboolib as bam
import statsmodels.api as sm

Reading Dataset

In [3]:
data = pd.read_csv('https://raw.githubusercontent.com/JaiminBrahmbhatt/MidsemQue/main/data.csv')
data

Note : All graphs are interactive to play with data visualization

1. Determine what the distribution of each datum is and give the appropriate parameters for that distirubion:

  1. Normal

  2. Continuos Uniform

  3. None of these (Neither)

So what is Normal Distibuted Data ?

A normal distribution is a common probability distribution . It has a shape often referred to as a "bell curve."

Many everyday data sets typically follow a normal distribution: for example, the heights of adult humans, the scores on a test given to a large class, errors in measurements.

The normal distribution is always symmetrical about the mean.

The standard deviation is the measure of how spread out a normally distributed set of data is. It is a statistic that tells you how closely all of the examples are gathered around the mean in a data set. The shape of a normal distribution is determined by the mean and the standard deviation. The steeper the bell curve, the smaller the standard deviation. If the examples are spread far apart, the bell curve will be much flatter, meaning the standard deviation is large.

Reference - https://www.varsitytutors.com/hotmath/hotmath_help/topics/normal-distribution-of-data#:~:text=The%20shape%20of%20a%20normal,the%20standard%20deviation%20is%20large.

So to find we can either plot the graphs based on their frequency by dividing them in small bins or we can plot the histograms or Kernel Distributed Estimation.

So here we calculate mean, standard deviation and z-score

In [4]:
data
In [5]:
data.mean()
Out[5]:
x   -0.526923
y   -0.023169
z    0.511323
w   -0.005269
dtype: float64
In [6]:
data.std()
Out[6]:
x    4.646678
y    6.683136
z    4.966079
w    6.345948
dtype: float64

Calculating Z-score of whole dataset based on axis = 0 (i.e based on individual columns)

In [7]:
datazscore = data.apply(stats.zscore)
datazscore

So here we see that y column shows Z-score of 0 that implies that Column Y has missing value

In [8]:
data.isnull().sum()
Out[8]:
x    0
y    1
z    0
w    0
dtype: int64

It is better to replace the missing value here I am considering mean value as there is only 1 missing value in dataframe

In [9]:
data[['y']] = data[['y']].fillna(data[['y']].mean())

Again calculating Z-score of whole dataset based on axis = 0 (i.e based on individual columns)

In [10]:
datazscore = data.apply(stats.zscore)
datazscore
In [11]:
px.box(data, points='all', template='presentation')

Plot based on KDE

Plotting the data based on kernel distrubution estimation (kde) gives us the proper distrubution of values in columns

In [12]:
ax = data.plot.kde()

From these distribution we can say that,

w = neither normal nor continuous uniform

x = neither normal nor continous uniform

y = normal distribution

z = normal distribution

In [13]:
px.histogram(datazscore, x='w'  ,template='ggplot2')

Column W seems to be neither normally distributed nor continuous uniform

In [14]:
px.histogram(datazscore, x='x'  ,template='ggplot2')

Column X seems to be neither normally distributed nor continuous uniform

In [15]:
px.histogram(datazscore, x='y'  ,template='ggplot2')

Column Y seems to be normally distributed

In [16]:
px.histogram(datazscore, x='z'  ,template='ggplot2')

Column Z seems to be normally distributed

2. Determine if the datum is periodic, and if so, what is it's frequency.

So here we start with plotting the data for each column w, x, y and z

In [17]:
px.line(data , template='presentation')

To better visualize everything plotting each column individually to get clear picture

In [48]:
px.line(y=data['w'], template='presentation' , title='Plot for Column W')

So here looking at graph we can directly say that there is some sort of periodicity in column W

And looking at the graph we can say that 100 instances are taken by the series to complete one cycle which means it will take about 3.333333333 seconds to repeat itself when we take regular interval of 30Hz or in other words it will repeat itself every 100 instances. So, here in this series there are 5 cycles in total

In [64]:
sm.graphics.tsa.plot_acf(data['w'], lags=100)
plt.show()
In [19]:
px.line(y=data['x'], template='presentation',  title='Plot for Column X')

So here looking at graph we cannot say if Column X has periodicity

In [20]:
px.line(y=data['y'], template='presentation',  title='Plot for Column Y')

So here looking at graph we cannot say if Column Y has periodicity

In [21]:
px.line(y=data['z'], template='presentation',  title='Plot for Column Z')

So here looking at graph we cannot say if Column Z has periodicity

3. Perform a Spearman and Pearson Correlation on each pair of datums and determine if any of the variables are correlated. Provide the correlation values.

Performing Spearman Correlation on each pair of datums

In [28]:
spearmancorr = data.corr(method='spearman')
spearmancorr
In [29]:
sns.heatmap(spearmancorr, 
            xticklabels=spearmancorr.columns,
            yticklabels=spearmancorr.columns,
            cmap='RdBu_r',
            annot=True,
            linewidth=0.5)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dc0a381280>

Using Spearman Correlation we can say that

X --> Y has decent positive relationship

Y --> Z has decent positive relationship

Performing Pearson Correlation on each pair of datums

In [32]:
pearsoncorr = data.corr(method='pearson')
pearsoncorr
In [33]:
sns.heatmap(pearsoncorr, 
            xticklabels=pearsoncorr.columns,
            yticklabels=pearsoncorr.columns,
            cmap='RdBu_r',
            annot=True,
            linewidth=0.5)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dc0b560b20>

Using Spearman Correlation we can say that

X --> Y has decent positive relationship

Y --> Z has decent positive relationship

4. Produce a scatter plot of each pair and a histogram of each variable.

In [34]:
px.scatter(data, x='w', y='x', template='presentation' ,  title='Scatter Plot W vs X')
In [35]:
px.scatter(data, x='w', y='y' ,template='presentation'  ,title='Scatter Plot W vs Y')
In [36]:
px.scatter(data, x='w', y='z'  ,template='presentation' ,title='Scatter Plot W vs Z')
In [37]:
px.scatter(data, x='x', y='y'  ,template='presentation' ,title='Scatter Plot X vs Y')
In [38]:
px.scatter(data, x='x', y='z'  ,template='presentation' ,title='Scatter Plot X vs Z')
In [39]:
px.scatter(data, x='y', y='z'  ,template='presentation' ,title='Scatter Plot Y vs Z')

5. Performing autocorrelation on each variable

In [41]:
data['x'].autocorr()
Out[41]:
0.05136999989773895

Seems there is very weak to none correlation (pattern) for datum X

In [42]:
data['y'].autocorr()
Out[42]:
-0.02674083959141378

Seems there is negative and very weak to none correlation (pattern) for datum Y

In [43]:
data['z'].autocorr()
Out[43]:
-0.10454861857811215

Seems there is very weak to none correlation (pattern) for datum Z

In [44]:
data['w'].autocorr()
Out[44]:
0.9979852033718756

It shows very strong correlation (pattern) for datum W

Visualization of Autocorrelation

In [45]:
from statsmodels.graphics.tsaplots import plot_acf
col = data.columns
for i in col:
    plot_acf(data[i])
    print(i)
    plt.show()
x
y
z
w
In [ ]: